import numpy as np
import pandas as pd
from scipy.stats import norm
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
# sklearn
from sklearn.naive_bayes import CategoricalNB # import Naive Bayes Classifier for categorial distributed features
from sklearn.naive_bayes import GaussianNB # import Naive Bayes Classifier for normal distributed features
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_supportExercise 4 – Classification II
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM $$
Hint
If you work with Python, import the following:
Exercise 1: Naive Bayes
You are given the following table with the target variable Banana:
| ID | Color | Form | Origin | Banana |
|---|---|---|---|---|
| 1 | yellow | oblong | imported | yes |
| 2 | yellow | round | domestic | no |
| 3 | yellow | oblong | imported | no |
| 4 | brown | oblong | imported | yes |
| 5 | brown | round | domestic | no |
| 6 | green | round | imported | yes |
| 7 | green | oblong | domestic | no |
| 8 | red | round | imported | no |
We want to use a Naive Bayes classifier to predict whether a new fruit is a Banana or not. Estimate the posterior probability \(\hat\pi(\mathbf{x}_{\ast})\) for a new observation \(\mathbf{x}_{\ast} = (\text{yellow}, \text{round}, \text{imported})\). How would you classify the object?
Solution
When using the naive Bayes classifier, the features \(\mathbf{x}:= (x_\text{Color},x_\text{Form},x_\text{Origin})\) are assumed to be conditionally independent of each other, given the category \(y = k \in \{\text{yes}, \text{no}\}\), s.t.
\[ \mathbb{P}(\mathbf{x}~|~ y = k) = \mathbb{P}((x_\text{Color}, x_\text{Form}, x_\text{Origin}) ~|~ y = k) = \mathbb{P}(x_\text{Color} ~|~ y = k) \cdot \mathbb{P}(x_\text{Form} ~|~ y = k) \cdot \mathbb{P}(x_\text{Origin} ~|~ y = k). \]
Recall Bayes’ theorem:
\[ \pi_{k}(\mathbf{x})= \mathbb{P}(y = k ~|~ \mathbf{x})= \frac{\mathbb{P}(\mathbf{x}| y= k) \mathbb{P}(y= k)}{\mathbb{P}(\mathbf{x})} \]
As the denominator is constant across all classes, the following holds for the posterior probabilities:
\[ \pi_{k}(\mathbf{x})\propto \underbrace{\pi_{k}\cdot \mathbb{P}(x_\text{Color} ~|~ y = k) \cdot \mathbb{P}(x_\text{Form} ~|~ y = k) \cdot \mathbb{P}(x_\text{Origin} ~|~ y = k)}_{=: \alpha_k(x)} \\ \iff \exists c \in \mathbb{R}: \pi_{k}(\mathbf{x})= c \cdot \alpha_k(\mathbf{x}) \]
where \(\pi_{k}= \mathbb{P}(y = k)\) is the prior probability of class \(k\) and \(c\) is the normalizing constant.
From this and since the posterior probabilities need to sum up to 1, we know that
\[ 1 = ~ c \cdot \alpha_\text{yes}(\mathbf{x}) + c \cdot \alpha_\text{no}(\mathbf{x}) ~\iff c = \frac{1}{\alpha_\text{yes}(\mathbf{x}) + \alpha_\text{no}(\mathbf{x})}. \]
This means that, in order to compute \(\pi_\text{yes}(\mathbf{x})\), the scores \(\alpha_\text{yes}(\mathbf{x})\) and \(\alpha_\text{no}(\mathbf{x})\) are needed.
Now we want to estimate for a new fruit the posterior probability \(\hat{\pi}_{yes}((\text{yellow}, \text{round}, \text{imported}))\).
Obviously, we do not know the true prior probability and the true conditional densities. Here – since the target and the features are categorical – we use a categorical distribution, i.e., the simplest distribution over a \(g\)-way event that is fully specified by the individual probabilities for each class (which must of course sum to 1). This is a generalization of the Bernoulli distribution to the multi-class case. We can estimate the distribution parameters via the relative frequencies encountered in the data: \[ \hat{\alpha}_\text{yes}(\mathbf{x}_{\ast}) = \hat{\pi}_{\text{yes}} \cdot \hat{\mathbb{P}}(x_\text{Color} = \text{yellow} ~|~ y = \text{yes}) \cdot \hat{\mathbb{P}}(x_\text{Form} = \text{round} ~|~ y = \text{yes}) \cdot \hat{\mathbb{P}}(x_\text{Origin} = \text{imported} ~|~ y = \text{yes}) \\ = \frac{3}{8} \cdot \frac{1}{3} \cdot \frac{1}{3} \cdot 1 = \frac{1}{24} \approx 0.042 \]
\[ \hat{\alpha}_\text{no}(\mathbf{x}_{\ast}) = \hat{\pi}_{\text{no}} \cdot \hat{\mathbb{P}}(x_\text{Color} = \text{yellow} ~|~ y = \text{no})\cdot \hat{\mathbb{P}}(x_\text{Form} = \text{round} ~|~ y = \text{no}) \cdot \hat{\mathbb{P}}(x_\text{Origin} = \text{imported} ~|~ y = \text{no}) \\ = \frac{5}{8} \cdot \frac{2}{5} \cdot \frac{3}{5} \cdot \frac{2}{5} = \frac{3}{50} = 0.060 \]
At this stage we can already see that the predicted label is “no”, since \(\hat{\alpha}_\text{no}(\mathbf{x}_{\ast}) = 0.060 > \frac{1}{24} = \hat{\alpha}_\text{yes}(\mathbf{x}_{\ast})\) – that is, if we threshold at 0.5 for predicting “yes”.
With the above we can compute the posterior probability \[ \hat{\pi}_\text{yes}(\mathbf{x}_{\ast}) = \frac{\hat{\alpha}_\text{yes}( \mathbf{x}_{\ast})}{\hat{\alpha}_\text{yes}(\mathbf{x}_{\ast}) + \hat{\alpha}_\text{no}(\mathbf{x}_{\ast})} \approx 0.410 < 0.5 \]
and check our calculations against the corresponding results:
Assume you have an additional feature Length that measures the length in cm. Describe in 1-2 sentences how you would handle this numeric feature with Naive Bayes.
Solution
Before, we only had categorical features and could use the empirical frequencies as our parameters in a categorical distribution. For the distribution of a numerical feature, given the the category, we need to define a probability distribution with continuous support. A popular choice is to use Gaussian distributions. For example, for the information \(x_\text{Length}\) we could assume that
\[ \mathbb{P}(x_\text{Length} ~|~ y = \text{yes}) \sim \mathcal{N}(\mu_\text{yes}, \sigma^2_\text{yes})\] and \[\mathbb{P}(x_\text{Length} ~|~ y = \text{no}) \sim \mathcal{N}(\mu_\text{no}, \sigma^2_\text{no}) \]
In order to fully specify these normal distributions we need to estimate their parameters \(\mu_\text{yes}, \mu_\text{no}, \sigma^2_\text{yes}, \sigma^2_\text{no}\) from the data via the usual estimators (empirical mean and empirical variance with bias correction).
Exercise 2: Discriminant Analysis
The above plot shows \(\mathcal{D}= \left( \left(\mathbf{x}^{(1)}, y^{(1)}\right), \ldots, \left(\mathbf{x}^{(n)}, y^{(n)}\right)\right)\), a data set with \(n = 200\) observations of a continuous target variable \(y\) and a continuous, 1-dimensional feature variable \(\mathbf{x}\). In the following, we aim at predicting \(y\) with a machine learning model that takes \(\mathbf{x}\) as input.
To prepare the data for classification, we categorize the target variable \(y\) in 3 classes and call the transformed target variable \(z\), as follows: \[ z^{(i)} = \begin{cases} 1, & y^{(i)} \in (-\infty, 2.5] \\ 2, & y^{(i)} \in (2.5, 3.5] \\ 3, & y^{(i)} \in (3.5, \infty) \end{cases} \]
Now we can apply quadratic discriminant analysis (QDA):
- Estimate the class means \(\mu_k = \mathbb{E}(\mathbf{x}|z = k)\) for each of the three classes \(k \in \{1, 2, 3\}\) visually from the plot. Do not overcomplicate this, a rough estimate is sufficient here.
Solution
As the data seem to be pretty symmetric conditional on the respective class, we estimate the class means to lie roughly in the middle of the data clusters: \(\hat \mu_1 = 1\), \(\hat \mu_2 = 7\), \(\hat \mu_3 = 4\).- Make a plot that visualizes the different estimated densities per class.
Solution
We see that the variances in classes 1 and 2 are similar and also much smaller than in class 3. Therefore, the densities could look roughly like this:
- How would your plot from ii) change if we used linear discriminant analysis (LDA) instead of QDA? Explain your answer.
Solution
Since LDA assumes constant variances across all classes (also if this does not reflect the data situation), all densities would have the same shape and only differ in location.- Why is QDA preferable over LDA for this data?
Solution
As we have already noted, the assumption of equal class variances is not justified here, but LDA is confined to equivariant distributions. Therefore, the more flexible QDA is preferable in this case. Note, however, that the Gaussian distributions both variants of discriminant analysis use might not be perfectly appropriate, as the data seems to be more uniformly distributed (conditional on the classes).Given are two new observations \(\mathbf{x}_{*1} = -10\) and \(\mathbf{x}_{*2} = 7\). State the prediction for QDA and explain how you arrive there.
Solution
The prediction for \(\mathbf{x}_{\ast 1}\) will probably be \(\hat z_{\ast 1} = 3\) because the density of class 3 has much larger variance and will therefore overshoot the density of class 1. For \(\mathbf{x}_{\ast 2}\) the case is clear and we have \(\hat z_{\ast 2} = 2\).Exercise 3: Decision Boundaries for Classification Learners
We will now visualize how well different learners classify the three-class mlbench::mlbench.cassini data set.
- Generate 1000 points from
cassiniusing R or importcassini_data.csvin Python. - Then, perturb the
x.2dimension with Gaussian noise (mean 0, standard deviation 0.5), and consider the classifiers already introduced in the lecture:- LDA (Linear Discriminant Analysis),
- QDA (Quadratic Discriminant Analysis), and
- Naive Bayes.
Plot the learners’ decision boundaries. Can you spot differences in separation ability?
(Note that logistic regression cannot handle more than two classes and is therefore not listed here.)
Solution
First, visualize the simulated cassini data:
Then, define the learners (and the task for R ) and train the models:
Lastly, plot their decision boundaries:
We see how LDA, with its confinement to linear decision boundaries, is not able to classify the data very well. QDA and NB, on the other hand, get the shape of the boundaries right. It also becomes obvious that NB is a quadratic classifier just like QDA - their decision surfaces look pretty much alike.